An advanced scRNAseq workflow

overview

Outlines

  • Merge multiple data sets - 4 different approaches for this
  • Droplet processing - 3 different approaches for this
  • Pseudotime analysis with slingshot
  • CITE-seq data processing

Merging Datasets


Merge multiple datasets

We will discuss a number of approaches and methods we may need to use to merge datasets.

  • No corrections
  • Reciprocal Principle Component Analysis (RPCA)
  • Mutual Nearest Neighbors (MNN)
  • Harmony

Example dataset

We will be using the IFNB-Stimulated and Control PBMCs from Seurat Data. The easiest way to get this data is from their custom package SeuratData which is hosted on GitHub.

We will quickly show you how to get this data, but it isn’t necessary to run these steps.

devtools::install_github("satijalab/seurat-data")
library(Seurat)
library(SeuratData)
InstallData("ifnb")
LoadData("ifnb")
## An object of class Seurat 
## 14053 features across 13999 samples within 1 assay 
## Active assay: RNA (14053 features, 0 variable features)
head(ifnb, 2)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono

Example dataset

This dataset is already loaded in as a Seurat object, and it is already merged. So we need to split it, so we can merge it ourselves! The groups are in the “stim” column of the metadata.

table(ifnb$stim)
ifnb_list <- Seurat::SplitObject(ifnb, split.by = "stim")
ifnb_list

Load in your dataset

We have the ifnb_list save in an RData object, which you can load in.

load("data/seuOBJ_IFNB_splitByStim.RData")

Create some functions

We are going to try out a few merging approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.

Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features

data_proc <- function(seu) {
    seu <- NormalizeData(seu, normalization.method = "LogNormalize", scale.factor = 10000)
    seu <- FindVariableFeatures(seu, select.method = "vst", nfeatures = 2000)
    return(seu)
}

Create some functions

Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5

quick_clust <- function(seu) {
    set.seed(1001)
    seu <- ScaleData(seu, verbose = FALSE)
    seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
    seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
    seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
    seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
    return(seu)
}

Simple Merge


Concatenate the datasets

We can use merge() function to integrate our two data sets, as they are. We need to provide Group information and a name for the project as arguments.

ifnb_merge <- merge(ifnb_list$CTRL, ifnb_list$STIM, add.cell.ids = c("CTRL", "STIM"),
    project = "ifnb_seuMerge")
head(ifnb_merge, 2)
##                        orig.ident nCount_RNA nFeature_RNA stim
## CTRL_AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL
## CTRL_AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL
##                       seurat_annotations
## CTRL_AAACATACATTTCC.1          CD14 Mono
## CTRL_AAACATACCAGAAA.1          CD14 Mono

Process and make clusters

We can use our processing and clustering functions to analyse our merged dataset.

ifnb_merge <- data_proc(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)

UMAP

Our UMAP shows our cells are distinct, depending on the condition.

DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)

Evaluate with cell types

A given cell type should often be clustered together. This pattern indicates the opposite. Different cell types are split into distinct groups depending on the sample.

DimPlot(ifnb_merge, group.by = "seurat_annotations", pt.size = 0.2, split.by = "stim")

STIM/CTRL don’t group

  • This result indicates the difference between STIM and CTRL groups is huge.
  • Is this a true biological phenomena or is it a batch effect?

Merge with reciprocal PCA


Merge with reciprocal PCA

Reciprocal PCA minimizes the batch effects while merging different data sets.

This workflow is modified from canonical correlation analysis (CCA), which is widely used to merge batches.

RPCA workflow

  1. Normalize data sets. We can use our data_proc().
  2. Select features for integration by using SelectIntegrationFeatures().
  3. Scale data and process PCA by using the features identified for integration
  4. Identify Anchors for integration by using FindIntegrationAnchors()
  5. Integrate data by using IntegratedData()
  6. Process quick_cluster() and evaluate results with UMAP

Prepare for RPCA merge

First, we prepare the data for integration. We will normalize the data sets separately. Than, we need to identify features for integration. This is similar to the VariableFeatures function we ran on a single dataset. Lastly we run scaling and PCA, using these features.

ifnb_list_rpca <- lapply(ifnb_list, data_proc)

feats <- SelectIntegrationFeatures(ifnb_list_rpca)

ifnb_list <- lapply(ifnb_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Integrating data in RPCA merge

We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.

anchors <- FindIntegrationAnchors(ifnb_list, anchor.features = feats, reduction = "rpca")

ifnb_merge <- IntegrateData(anchorset = anchors)

ifnb_merge
## An object of class Seurat 
## 16053 features across 13999 samples within 2 assays 
## Active assay: integrated (2000 features, 2000 variable features)
##  1 other assay present: RNA

Evaluating RPCA using clusters

To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.

We can now see that our two data sets overlay with each other.

ifnb_merge <- ScaleData(ifnb_merge)

ifnb_merge <- quick_clust(ifnb_merge)

DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)

Evaluating RPCA using clusters

We can now see that our two data sets overlay with each other.

Evaluating RPCA using clusters

We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now.

tbl <- table(ifnb_merge$stim, ifnb_merge$seurat_clusters)

barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")

Evaluating RPCA using cell types

We can also check the cell types. Using UMAPs we can split and compare across our conditions and cell types.

DimPlot(ifnb_merge, group.by = "seurat_annotations", split.by = "stim", pt.size = 0.2)

Evaluating RPCA using cell types

Using heatmaps we can also check how specific each cluster is to each cell type.

library(pheatmap)

tbl <- table(ifnb_merge$seurat_clusters, ifnb_merge$seurat_annotations)
pheatmap(tbl, scale = "column")

Merge data with MNN correction


Merge data with MNN

Mutual Nearest Neighbors (MNN) approach uses paired cells instead of anchor genes to find the difference between batches.

The MNN correction was published by Marioni et al, Nature (2018). link

Steps for merging data with MNN

  1. Convert into SingleCellExperiment
  2. Identify features across samples
  3. Normalization
  4. Identify variables
  5. Merge data with MNN correction
  6. Build UMAP and clusters
  7. Evaluate with UMAP
  8. Evaluate composition of samples in each cluster

Preparing to merge data with MNN

First we have to convert Seurat object to SingleCellExperiment object.

sce_list <- lapply(ifnb_list, function(seu) {
    sce <- as.SingleCellExperiment(seu, assay = "RNA")
    rowData(sce)$SYMBOL <- rownames(sce)
    return(sce)
})


sce_list
## $CTRL
## class: SingleCellExperiment 
## dim: 14053 6548 
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(6548): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTTCGC.1
##   TTTGCATGGTCCTC.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
## 
## $STIM
## class: SingleCellExperiment 
## dim: 14053 7451 
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(7451): AAACATACCAAGCT.1 AAACATACCCCTAC.1 ... TTTGCATGCTAAGC.1
##   TTTGCATGGGACGA.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):

Preparing to merge data with MNN

As with RPCA we need to model and identify highly variable genes. For this we will use the scran functions modelGeneVar() and getTopHVGs(). We simply provide our SingleCellExperiment object.

library(scran)
dec_list <- lapply(sce_list, modelGeneVar)

hvgc_list <- lapply(sce_list, getTopHVGs, prop = 0.1)

Preparing to merge data with MNN

Next we will find the features that are shared between samples. We must first define the shared “universe” of genes between our samples, and subset our variable genes to these. We can then combine the the variance to find variable features in both data sets.

universe <- intersect(rownames(sce_list$CTRL), rownames(sce_list$STIM))
sce_list <- lapply(sce_list, function(sce, universe) {
    sce <- sce[universe, ]
    return(sce)
}, universe)
dec_list <- lapply(dec_list, function(dec, universe) {
    dec <- dec[universe, ]
    return(dec)
}, universe)

combined_dec <- combineVar(dec_list$CTRL, dec_list$STIM)
chosen_hvgs <- combined_dec$bio > 0

Merge data with MNN

We will use the batchelor package to run MNN with the fastMNN() function. * d: number of principles evaluated * k: number of nearest neighbors to consider * subset.row: subset genes. Here, we are using the top variable features

library(batchelor)
mnn_res <- fastMNN(CTRL = sce_list$CTRL, STIM = sce_list$STIM, d = 50, k = 20, subset.row = chosen_hvgs)
mnn_res
## class: SingleCellExperiment 
## dim: 1958 13999 
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(1958): HES4 ISG15 ... CTD-2521M24.5 AJ006998.2
## rowData names(1): rotation
## colnames(13999): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTAAGC.1
##   TTTGCATGGGACGA.1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):

Merge data with MNN

The resulting SingleCellExperiment object contains the batch information. We can also retrieve a matrix of our dimension reduction results and corrected log counts with the reducedDim() and assay() functions.

table(mnn_res$batch)
## 
## CTRL STIM 
## 6548 7451
reducedDim(mnn_res, "corrected")[1:2, ]
##                       [,1]         [,2]        [,3]        [,4]        [,5]
## AAACATACATTTCC.1 0.4692084  0.002558405  0.24112457 -0.01932945 -0.02919753
## AAACATACCAGAAA.1 0.4849751 -0.221433858 -0.03194568 -0.04457300  0.02085705
##                           [,6]        [,7]        [,8]         [,9]       [,10]
## AAACATACATTTCC.1  0.0001116506 -0.05219881 -0.01560542 -0.018456594 -0.02042355
## AAACATACCAGAAA.1 -0.0208653692  0.02518229 -0.05640069  0.001590785  0.02042924
##                         [,11]       [,12]        [,13]       [,14]
## AAACATACATTTCC.1 -0.008021795 -0.04396845  0.005131835  0.01141103
## AAACATACCAGAAA.1 -0.006484712  0.02690261 -0.014495238 -0.01603657
##                          [,15]        [,16]       [,17]        [,18]
## AAACATACATTTCC.1 -0.0002610275 -0.001044311  0.02583134  0.004660613
## AAACATACCAGAAA.1  0.0417677345  0.002738321 -0.01539083 -0.028006127
##                         [,19]       [,20]        [,21]       [,22]       [,23]
## AAACATACATTTCC.1 -0.003453839 0.007833436  0.004925768  0.01688819 -0.01693241
## AAACATACCAGAAA.1  0.010241154 0.016709708 -0.019963263 -0.02561860  0.01056931
##                         [,24]        [,25]       [,26]        [,27]
## AAACATACATTTCC.1 -0.003615097 -0.007882785 -0.01073247 -0.007515978
## AAACATACCAGAAA.1  0.010869553 -0.005000099  0.02215598 -0.000877394
##                         [,28]       [,29]        [,30]       [,31]       [,32]
## AAACATACATTTCC.1 -0.006422341 0.004441876 -0.002957034 -0.01006521 0.002491115
## AAACATACCAGAAA.1  0.003697462 0.014406622  0.003685268  0.01067559 0.021875399
##                         [,33]        [,34]        [,35]        [,36]
## AAACATACATTTCC.1 -0.001875363 -0.008862148 -0.002032289  0.008468607
## AAACATACCAGAAA.1  0.011806194  0.003560579  0.002324063 -0.002952769
##                         [,37]        [,38]        [,39]       [,40]
## AAACATACATTTCC.1 -0.007954117 0.0009951654 -0.002792572 0.002435518
## AAACATACCAGAAA.1 -0.006034000 0.0009532853  0.002280508 0.010147309
##                         [,41]         [,42]         [,43]        [,44]
## AAACATACATTTCC.1  0.015530208 -0.0009633948  0.0026935382 0.0002299324
## AAACATACCAGAAA.1 -0.003641573  0.0034695645 -0.0008694543 0.0037271479
##                         [,45]        [,46]        [,47]        [,48]
## AAACATACATTTCC.1 -0.003499860 -0.001978646 -0.009881689  0.004610468
## AAACATACCAGAAA.1 -0.002181879 -0.001995283 -0.005362725 -0.001792342
##                         [,49]         [,50]
## AAACATACATTTCC.1 -0.001164966 -0.0095759157
## AAACATACCAGAAA.1  0.005418251  0.0003605282
assay(mnn_res, "reconstructed")[1:2, 1:5]
## <2 x 5> LowRankMatrix object of type "double":
##       AAACATACATTTCC.1 AAACATACCAGAAA.1 AAACATACCTCGCT.1 AAACATACCTGGTA.1
## HES4      -0.001084752     -0.001208987     -0.001217453      0.001368413
## ISG15     -0.112620920     -0.128283650     -0.111386942     -0.138798526
##       AAACATACGATGAA.1
## HES4      -0.001118792
## ISG15     -0.129904486

UMAP of data merged with MNN

Make UMAP using the scater package.

library(scater)
set.seed(1001)
mnn_res <- runUMAP(mnn_res, dimred = "corrected")
mnn_res$batch <- factor(mnn_res$batch)
plotUMAP(mnn_res, colour_by = "batch")

Clustering data merged with MNN

We will cluster using SNN graph approach from scran as this is comptaible with our SingleCellExperiment object.

snn.gr <- buildSNNGraph(mnn_res, use.dimred = "corrected")
cluster_mnn <- igraph::cluster_louvain(snn.gr)$membership
table(cluster_mnn)
## cluster_mnn
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  656 1455 1139  432  964 1052  769  786  936 2043 1299   44 2368   56

Clustering data merged with MNN

Lets check the UMAP for our clustered results.

mnn_res$cluster <- factor(cluster_mnn)
plotUMAP(mnn_res, colour_by = "cluster")

Clustering data merged with MNN

We can also check how many cells from each cluster belong to each group. With this approach you can see that there is a lot more group-specific clusters.

tbl <- table(mnn_res$batch, mnn_res$cluster)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")

Evaluate with cell types

We must annotate our SingleCellExperiment object with different cell type information. We can then visualize the cell types on our UMAP.

cellType <- lapply(sce_list, function(x) {
    res <- setNames(as.character(colData(x)$seurat_annotations), colnames(x))
    return(res)
})
cell_type <- c(cellType$CTRL, cellType$STIM)
mnn_res$cell_type <- cell_type[match(rownames(colData(mnn_res)), names(cell_type))]
mnn_res$cell_type <- factor(mnn_res$cell_type)

Evaluate with cell types

plotUMAP(mnn_res, colour_by = "cell_type")

Evaluate with cell types

We can also use a heatmap to look at the specificity of each cell type to each cluster. Most are cluster-specific.

tbl <- table(mnn_res$cluster, mnn_res$cell_type)
pheatmap(tbl, scale = "column")

MNN vs RPCA

Performance is dependent on experimental context.

Generally: * RPCA - More homogeneous
* MNN - More heterogeneous

Merge data with Harmony


Harmony

  • Harmony is an R package for single-cell data integration liike.
  • The user manual of Harmony is also available link.
  • There is also a python implementation of this package.
  • Here, we will demonstrate how to integrate harmony into Seurat regular workflow.

Prepare data for Harmony

We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.

seu_obj <- merge(ifnb_list$CTRL, ifnb_list$STIM)
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)
seu_obj <- RunPCA(seu_obj)
seu_obj <- RunUMAP(seu_obj, reduction = "pca", dims = 1:10, reduction.name = "umap")

Prepare data for Harmony

As you can see we are back with our completely seperate groups.

DimPlot(seu_obj)

Merge data with Harmony

We can use the RunHarmony() function to implement the Harmony correction.

library(harmony)
seu_obj <- RunHarmony(seu_obj, group.by.vars = "stim", assay.use = "RNA")
seu_obj <- RunUMAP(seu_obj, reduction = "harmony", dims = 1:10, reduction.name = "umap_harmony")
DimPlot(seu_obj, reduction = "umap_harmony")

An advanced scRNAseq workflow

overview

Cell type annotation


Cell type annotation

To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation, like we did in section II. Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link

Prepare example datasets

The demo data, panc8, was fetched by using SeuratData(). It contains single-cell RNA-Seq of human pancreatic islet sequenced with different techniques.

The demo data contains * CELSeq dataset1 (GSE81076), as reference * CELSeq dataset2 (GSE85241), as reference * SMART-Seq2 (E-MTAB-5061), as reference * Fluidigm C1 (GSE86469), as query

As we did before we will split this dataset. We have also prepped this dataset for you, if you want to load the data object in directly.

library(Seurat)
library(SeuratData)
InstallData("panc8")
data("panc8")

Prepare example datasets

Split up the datasets by techniques using SplitObject().

seu_list <- SplitObject(panc8, split.by = "tech")
names(seu_list) <- c("celseq1", "celseq2", "smartseq", "fluidigmc1", "indrop")
load("data/panc8.RData")

Prepare reference dataset

For this we need a reference dataset and a query dataset. Merge celseq1 and celseq2 data with RPCA to make the reference.

seu_list <- lapply(seu_list, data_proc)
ref_list <- seu_list[c("celseq1", "celseq2")]
feats <- SelectIntegrationFeatures(ref_list)
ref_list <- lapply(ref_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Prepare reference dataset

Merge celseq1 and celseq2 data with RPCA

anchors <- FindIntegrationAnchors(ref_list, anchor.features = feats, reduction = "rpca")
ref_panc <- IntegrateData(anchorset = anchors)
ref_panc <- ScaleData(ref_panc)
ref_panc <- quick_clust(ref_panc)

Prepare reference dataset

How does the data look?

DimPlot(ref_panc, group.by = "seurat_clusters", split.by = "tech", pt.size = 0.2,
    label = TRUE)

Reference and query datasets

panc_list <- list(ref = ref_panc, query = seu_list$smartseq)
feats <- SelectIntegrationFeatures(panc_list)
panc_list <- lapply(panc_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Predict cell types for query

Identify the anchors between reference and query data sets, using FindTransferAnchors(). These are essential to transfer information from our reference to our query. We can then transfer the cell type information. For each cell in query dataset, the score for each given cell type was estimated by the gene expression pattern of anchor genes using the TransferData() function.

anchors <- FindTransferAnchors(reference = panc_list$ref, query = panc_list$query,
    dims = 1:30, reference.reduction = "pca")
pred_res <- TransferData(anchorset = anchors, refdata = panc_list$ref$celltype)
head(pred_res, 2)
##       predicted.id prediction.score.gamma prediction.score.acinar
## AZ_A2        gamma            0.988556270                       0
## AZ_B9        alpha            0.004147972                       0
##       prediction.score.alpha prediction.score.delta prediction.score.beta
## AZ_A2             0.01144373            0.000000000              0.000000
## AZ_B9             0.85757406            0.009551014              0.128727
##       prediction.score.ductal prediction.score.endothelial
## AZ_A2                       0                            0
## AZ_B9                       0                            0
##       prediction.score.activated_stellate prediction.score.schwann
## AZ_A2                                   0                        0
## AZ_B9                                   0                        0
##       prediction.score.mast prediction.score.macrophage
## AZ_A2                     0                           0
## AZ_B9                     0                           0
##       prediction.score.epsilon prediction.score.quiescent_stellate
## AZ_A2                        0                                   0
## AZ_B9                        0                                   0
##       prediction.score.max
## AZ_A2            0.9885563
## AZ_B9            0.8575741

Predict cell types for query

The cell type with highest score was assigned to the given cell. We can visualize this score with a heatmap.

mat <- as.matrix(pred_res[, -c(1, 15)])
colnames(mat) <- gsub("prediction.score.", "", colnames(mat))
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)

Load cell types into query

pred_cellType <- setNames(pred_res$predicted.id, rownames(pred_res))
panc_list$query[["cellType_predBySeurat"]] <- pred_cellType[match(Cells(panc_list$query),
    names(pred_cellType))]

head(panc_list$query, 2)
##       orig.ident nCount_RNA nFeature_RNA      tech replicate assigned_cluster
## AZ_A2         AZ     384801         3611 smartseq2 smartseq2             <NA>
## AZ_B9         AZ     654549         4433 smartseq2 smartseq2             <NA>
##       celltype   dataset cellType_predBySeurat
## AZ_A2    gamma smartseq2                 gamma
## AZ_B9    alpha smartseq2                 alpha
table(panc_list$query$cellType_predBySeurat)
## 
##             acinar activated_stellate              alpha               beta 
##                192                 55               1016                320 
##              delta             ductal        endothelial              gamma 
##                125                440                 21                204 
##         macrophage               mast quiescent_stellate            schwann 
##                  7                  7                  5                  2

Annotation with SingleR

SingleR is bioconductor package which can be used to predict annotations of a single-cell dataset.

This package uses SingleCellExperiment objects, so we need to convert our Seurat object.

sce_list <- lapply(panc_list, function(seu) {
    sce <- as.SingleCellExperiment(seu, assay = "RNA")
    return(sce)
})

Annotation with SingleR

The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score.

library(SingleR)
pred_res <- SingleR(ref = sce_list$ref, test = sce_list$query, labels = sce_list$ref$celltype)
head(pred_res, 2)
## DataFrame with 2 rows and 4 columns
##                               scores      labels delta.next pruned.labels
##                             <matrix> <character>  <numeric>   <character>
## AZ_A2 0.300904:0.178568:0.436055:...       gamma  0.1394303            NA
## AZ_B9 0.318227:0.227997:0.529730:...       alpha  0.0861217         alpha

Annotation with SingleR

By converting to a matrix, we can check the cell type scoring using a heatmap.

mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)

Import SingleR annotation

Lastly we want to add our annotation back to our query dataset.

cell_type <- setNames(pred_res$pruned.labels, rownames(pred_res))
panc_list$query$cellType_predBySingleR <- cell_type[match(Cells(panc_list$query),
    names(cell_type))]
head(panc_list$query, 2)
##       orig.ident nCount_RNA nFeature_RNA      tech replicate assigned_cluster
## AZ_A2         AZ     384801         3611 smartseq2 smartseq2             <NA>
## AZ_B9         AZ     654549         4433 smartseq2 smartseq2             <NA>
##       celltype   dataset cellType_predBySeurat cellType_predBySingleR
## AZ_A2    gamma smartseq2                 gamma                   <NA>
## AZ_B9    alpha smartseq2                 alpha                  alpha

Seurat vs SingleR annotation

First we can compare this back to the original annotation. We will look for how many overlap.

Seurat annotation:

table(panc_list$query$cellType_predBySeurat == panc_list$query$celltype)
## 
## FALSE  TRUE 
##    37  2357

SingleR annotation:

table(panc_list$query$cellType_predBySingleR == panc_list$query$celltype)
## 
## FALSE  TRUE 
##    35  2299

Seurat vs SingleR annotation

Next we can compare our two annotations:

table(panc_list$query$cellType_predBySeurat == panc_list$query$cellType_predBySingleR)
## 
## FALSE  TRUE 
##    39  2295
tbl <- table(panc_list$query$cellType_predBySeurat, panc_list$query$cellType_predBySingleR)
pheatmap::pheatmap(tbl, scale = "row")

Seurat vs SingleR annotation

This isn’t always the case. This is good demonstration data.

  • Seurat: Wins on speed
  • SingleR: More reliable and consistent

An advanced scRNAseq workflow

overview

Droplet processing


Droplet processing

Often empty droplets can still make their way past Cell Ranger, or we may want to do custom filtering outside of Cell Ranger.

Ambient RNAs from lysed cells can contaminate droplets and make empty droplets seem like they contain a cell.

We are revisiting the PBMC 10k data from 10X Genomics as an example to deal with these issues. This time we will use the raw matrix.

Detect TRUE cells

  • The knee plot is available in the Web Summary from cellranger.
  • We can also draw the knee plot by using DropletUtils

Grab dataset

download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz",
    "10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")

Load dataset

We need to load in the dataset, but it is important not to set any filters yet.

library(DropletUtils)
library(Seurat)

raw_mtx <- Seurat::Read10X("raw_feature_bc_matrix")
dim(raw_mtx)
## [1]   36601 2099284

Load dataset

Calculate the counts for each cell barcode and their rank.

bcrank <- barcodeRanks(raw_mtx)
head(bcrank, 2)
## DataFrame with 2 rows and 3 columns
##                         rank     total    fitted
##                    <numeric> <integer> <numeric>
## AAACCCAAGAAACCAT-1   1752547         0        NA
## AAACCCAAGAAACCCA-1    914872         1        NA

Draw Knee plot

To draw a custom knee plot we need to remove duplicated ranks and grab the knee and inflection points.

uniq <- !duplicated(bcrank$rank)
bcrank <- bcrank[uniq, ]

knee <- metadata(bcrank)$knee
message(paste0("knee point: ", knee))
inflection <- metadata(bcrank)$inflection
message(paste0("inflection point: ", inflection))

Draw Knee plot

We can now draw our knee plot using ggplot().

ggplot(as.data.frame(bcrank), aes(x = rank, y = total)) + geom_point() + geom_hline(yintercept = knee,
    linetype = "dashed", color = "blue") + geom_hline(yintercept = inflection, linetype = "dashed",
    color = "green") + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10") +
    labs(x = "cell barcode ranked by counts", y = "UMI counts of each cell barcode") +
    theme_classic()

Draw Knee plot

We can now draw our knee plot using ggplot(). The knee is labelled in blue. The inflection is labelled in green.

Detect empty droplets

We can detect empty droplets with DropletUtils::emptyDrops(). - The p-value for each cell is performed by Monte Carlo simulation basing on the deviation of a given cell to the ambient RNA pool.

Detect empty droplets

The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.

ggplot(as.data.frame(e.out), aes(x = Total)) + geom_histogram() + geom_vline(xintercept = knee,
    color = "red", linetype = "dashed") + labs(x = "UMI counts per cell", y = "Frequency") +
    scale_y_continuous(trans = "log10") + scale_x_continuous(trans = "log10") + theme_classic()

Detect empty droplets

The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.

Filter empty droplets

In the DropletUtils manual, the FDR threshold is set to 0.001 in most cases. We can filter based on this cutoff.

table(e.out$FDR < 0.001)
## 
## FALSE  TRUE 
##   611 11574
cell_sel <- rownames(e.out)[e.out$FDR < 0.001]
filt_mtx <- raw_mtx[, colnames(raw_mtx) %in% cell_sel]

Ambient RNAs

  • The cell-free RNA molecules randomly spread in the solution. It can be enclosed into droplets during the steps of library preparation. So, it would be falsely increase the UMI counts of genes.

  • Ambient RNAs present two signature

    • The majority are found in the empty droplets population.
    • No specificity. This contamination can make a single cell look like it expresses two (or more) mutually exclusive marker genes.
  • According to these two properties, we could estimate ambient RNA contamination rates and correct it. Here we will demonstrate how to correct ambient RNA contamination with DropletUtils and SoupX, respectively.

Correct ambient RNAs

Lets start by estimating our ambient RNA contamination. This is from DropletUtils.

amb <- metadata(e.out)$ambient[, 1]
head(amb)
##   AL627309.1   AL627309.3   AL627309.5   AL627309.4   AL669831.2    LINC01409 
## 8.369426e-07 9.859354e-08 5.635719e-06 9.859354e-08 9.859354e-08 6.511074e-06

Correct ambient RNAs

We can read back in our filtered PBMC data.

download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
    "10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")

Correct ambient RNAs

Then filter out empty droplets.

filt_mtx_drop <- filt_mtx[rownames(filt_mtx) %in% names(amb), ]
seu <- CreateSeuratObject(filt_mtx_drop)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
sce <- as.SingleCellExperiment(seu, assay = "RNA")

Correct ambient RNAs

plotUMAP(sce, colour_by = "seurat_clusters")

Correct ambient RNAs

Lets put our uncorrected, filtered dataset in a list.

seu_list <- list()
seu_list[["woCorr"]] <- seu

Correct ambient RNAs

We can use the to removeAmbience() function to remove ambient RNAs.

out <- removeAmbience(counts(sce), ambient = amb, groups = sce$seurat_clusters)
rownames(out) <- rownames(sce)
colnames(out) <- colnames(sce)

Correct ambient RNAs

Now that we have run a correction, we want to reprocess and cluster our data.

seu_list[["withCorr"]] <- CreateSeuratObject(out)
seu_list[["withCorr"]] <- data_proc(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- ScaleData(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- quick_clust(seu_list[["withCorr"]])

Evaluate with marker genes

Lets compare to our marker genes.

mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
    "PPBP")
mark_gene
## [1] "CCR7"   "CD8A"   "MS4A1"  "CD14"   "FCGR3A" "FCER1A" "GNLY"   "NKG7"  
## [9] "PPBP"

Evaluate with marker genes

First lets look at the UMAP without ambient RNA correction.

DimPlot(seu_list$woCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker genes

Lets now compare to UMAP wit ambient RNA correction.

DimPlot(seu_list$withCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker genes

Lets now look at a violin plot in the uncorrected data.

VlnPlot(seu_list$woCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Evaluate with marker genes

And then again in the corrected.

VlnPlot(seu_list$withCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Ambient RNA and SoupX

SoupX is an alternative tool for droplet processing, developed for ambient RNA detection and correction link. It includes the following steps - Estimate ambient RNA expression in the empty droplets. - Estimate contamination rate by using the ambient RNA expression levels across clusters. - Correct ambient RNA contamination.

Prepare essential files

We need our filtered and unfiltered matrix. Plus our processed filtered Seurat object.

clust <- setNames(seu$seurat_clusters, Cells(seu))
seu_filt <- seu

Import Matrices

Import our matrices into the SoupX channel object.

library(SoupX)
soupOBJ <- SoupChannel(raw_mtx, filt_mtx)
soupOBJ <- setClusters(soupOBJ, clust)

Automatic estimates

The autoEstCont() can be used to detect contamination. We can then export the correct counts with adjustCount().

soupOBJ <- autoEstCont(soupOBJ)

autoCor_mtx <- adjustCounts(soupOBJ)
load("data/soup.RData")
load("data/auto_soup.RData")

Build a Seurat object

We will now put our corrected matrix into a Seurat object.

seu <- CreateSeuratObject(autoCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_autoCorr <- seu

Estimate contamination

We can estimate contamination using known marker genes.

mark_list <- list(`CD4 T-cell` = c("IL7R", "CCR7", "S100A4"), `CD8 T-cell` = c("CD8A"),
    `B-Cell` = c("MS4A1"), Monocyte = c("CD14", "FCGR3A"), DC = c("FCER1A"), NK = c("NKG7",
        "GNLY"), Platelet = c("PPBP"))

use_toEst <- estimateNonExpressingCells(soupOBJ, nonExpressedGeneList = mark_list)
soupOBJ <- calculateContaminationFraction(soupOBJ, mark_list, useToEst = use_toEst)
rho <- unique(soupOBJ$metaData$rho)
rho
## [1] 0.0193281

Export matrix after correction

soupOBJ <- setContaminationFraction(soupOBJ, rho, forceAccept = TRUE)
estCor_mtx <- adjustCounts(soupOBJ)

Add exported matrix to Seurat

seu <- CreateSeuratObject(estCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_estCorr <- seu

Evaluate marker genes

mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
    "PPBP")
mark_gene
## [1] "CCR7"   "CD8A"   "MS4A1"  "CD14"   "FCGR3A" "FCER1A" "GNLY"   "NKG7"  
## [9] "PPBP"

Without Correction

DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()

Without Correction

VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

With Correction (auto)

DimPlot(seu_autoCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

With Correction (auto)

VlnPlot(seu_autoCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

With Correction (est)

DimPlot(seu_estCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

With Correction (est)

VlnPlot(seu_estCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

CellBender

CellBender is a python toolkit developed to eliminate technical artifacts from scRNA-Seq data such as empty droplets and ambient RNAs link. It is run from the command line.

In the current version, it contains remove-background module, which is applied to: * Detect empty droplets and ambient RNAs from raw count matrix in CellRanger output * To remove empty droplets and to correct the interference of ambient RNAs

  • CellBender is based on machine-learning strategy. It’s quite time-consuming. We can speed up the processing by using CUDA version of CellBender.

Command line of CellBender

input_h5=the_raw_matrix_in_h5_format_from_cellranger #essential
output_h5=assign_the_h5_file_path_for_the_cellbender_corrected_matrix # essential
expect_cell=expected_cell_number_can_be_find_in_cellranger_Web_Summary # essential
droplet_num=the_total_number_of_droplets_assigned_while_sequencing # default 25,000
fpr=threshols_of_FALSE_POSITIVE_RATE # default 0.01
epochs=number_of_epochs_to_train # default 150
num_train=number_of_times_to_attempt_to_train_the_model # default 1. would speed up while setting greater
#
cellbender remove-background --input $input_h5 --output $output_h5 --expected-cells $expect_cell --total-droplets-included $droplet_num --fpr $fpr --epochs $epochs --num-training-tries $num_train --cuda False

Cell Ranger vs. CellBender

We have some processed results here from CellBender. We can compare this to our filtered matrix from Cell Ranger

cbFilt_mtx <- Read10X_h5("data/cbFilt_PBMCv3_20230324_filtered.h5")
dim(cbFilt_mtx)
## [1] 36601 12244
dim(filt_mtx)
## [1] 36601 11485

Data processing

message("processing matrix from CellRanger")
seu <- CreateSeuratObject(filt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_filt <- seu

message("processing matrix from CellBender")
seu <- CreateSeuratObject(cbFilt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_cbFilt <- seu

Evaluate with marker gene exprssion

mark_gene <- c("CD3E", "CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY",
    "PPBP")
mark_gene
## [1] "CD3E"   "CCR7"   "CD8A"   "MS4A1"  "CD14"   "FCGR3A" "FCER1A" "GNLY"  
## [9] "PPBP"

Evaluate with marker gene exprssion

DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()

Evaluate with marker gene exprssion

DimPlot(seu_cbFilt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker gene exprssion

VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Evaluate with marker gene exprssion

VlnPlot(seu_cbFilt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

An advanced scRNAseq workflow

overview

Trajectory and Psuedotime analysis


Trajectory and pseudotime analysis

  • Many biological processes are link a dynamic changes in the cellular state e.g. immune cell activation.
  • Trajectory analysis is a way to postulate this kind dynamic changes in single-cell data.
  • According to the postulated trajectories, we can also estimate pseudotime of each cell included in a particular trajectory.
  • Here, we demonstrate how to find out trajectories reflecting hematopoietic stem cell differentiation.

Prepare the demo

  • In this practice, we are using the SMART-Seq data of mouse hematopoietic stem cells (HSCs) published Nestorowa et al., Blood(2016) link.
  • We will provide the SingleCellExperiment
  • The SingleCellExperiment (sce) object was downloaded and processed by following this vignette link.
sce <- readRDS("data/sceOBJ_Nestorowa_HSC.rds")
sce
## class: SingleCellExperiment 
## dim: 46078 1656 
## metadata(0):
## assays(2): counts logcounts
## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000107391 ENSMUSG00000107392
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(5): cell.type FACS sizeFactor label cellType
## reducedDimNames(3): diffusion PCA UMAP
## mainExpName: endogenous
## altExpNames(1): ERCC

Estimate trajectories by fitting principle curves

To find the trajectories we need to fit our high dimensional data onto a curve. This step is processed with the slingshot link package.

To run slingshot we need: * Our sce object: sce * Clusters: sce$label * Our choice of dimension reduction approach: PCA * To reduce computational load the adjacent 100 cells are aggregated into single point (approx_points=100) * To avoid connecting unrelated trajectories, OMEGA clustering is introduced (omega=TRUE).

library(slingshot)
sce.sling <- slingshot(sce, cluster = sce$label, reducedDim = "PCA", approx_points = 100,
    omega = TRUE)

Estimate trajectories by fitting principle curves

colData(sce.sling)[1, 1:5]
## DataFrame with 1 row and 5 columns
##                     cell.type                        FACS sizeFactor    label
##                   <lgCMatrix>                    <matrix>  <numeric> <factor>
## HSPC_025 FALSE:FALSE:TRUE:... 27675.2:1.17991:1.12999:...   0.499986        3
##              cellType
##           <character>
## HSPC_025 Erythrocytes

Extract informations for each trajectory

Slingshot has fitted principle curves. We now will embed these cruves in UMAP space embedCurves(). We can see the number o lineages detected at this point.

embedded <- embedCurves(sce.sling, "UMAP")

embedded@metadata$lineages
## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
## 
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
## 
## $Lineage3
## [1] "7"

Estimate pseudotime

Once we have lineages, we can then find the position of each cell along these trajectories. This is estimating the pseudotime.

pseudo.paths <- slingPseudotime(sce.sling)
head(pseudo.paths, 2)
##           Lineage1 Lineage2 Lineage3
## HSPC_025 111.82993       NA       NA
## HSPC_031  96.15939 99.77961       NA

It will be useful to combine these times, to find the shared time.

avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling)$avg_pseudoTime <- avg_pseudoTime

Estimate pseudotime

Once we have the average pseudotime we can always project these values on to our UMAP.

plotUMAP(sce.sling, colour_by = "avg_pseudoTime")

Plotting principle curves

The slingCurves() function can fetch essential information of each principle curve. The results are inside a list, with each principle curve in an element. For each principle curve, we can extract the UMAP data and the order of cells for plotting.

embedded_curve <- slingCurves(embedded)

curve <- lapply(embedded_curve, function(x) {
    dat <- data.frame(x$s[x$ord, ])  # UMAP axis and the order of cells
    return(dat)
})
names(curve) <- c("curve1", "curve2", "curve3")

head(curve$curve1)
##       UMAP1     UMAP2
## 1 -12.23250 0.6437628
## 2 -12.01319 0.7200463
## 3 -11.79388 0.7963086
## 4 -11.57455 0.8725464
## 5 -11.35521 0.9487473
## 6 -11.13585 1.0248888

Plotting principle curves

We can now just add our curve information directly to our pseudotime UMAP plot using ggplot parameters geom_path().

plotUMAP(sce.sling, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
    y = UMAP2), color = "black") + geom_path(data = curve$curve3, aes(x = UMAP1,
    y = UMAP2), color = "red")

Refine the estimations

The results indicated that lineage 3 is an outlier. If we look back at our lineage information this contains cluster 7.

embedded@metadata$lineages
## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
## 
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
## 
## $Lineage3
## [1] "7"

Refine the estimations

We can remove cluster 7 and re-evaluate trajectories again. First we rerun the pseudotime estimation.

sce2 <- sce[, colData(sce)$label != 7]
sce.sling2 <- slingshot(sce2, cluster = sce2$label, reducedDim = "PCA", approx_points = 100,
    omega = TRUE)
pseudo.paths <- slingPseudotime(sce.sling2)
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling2)$avg_pseudoTime <- avg_pseudoTime

Refine the estimations

Next we extract out the princple curves again.

embedded <- embedCurves(sce.sling2, "UMAP")
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
    dat <- data.frame(x$s[x$ord, ])  # UMAP axis and the order of cells
    return(dat)
})
names(curve) <- c("curve1", "curve2")

Refine the estimations

plotUMAP(sce.sling2, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
    y = UMAP2), color = "black")

Trajectory and pseudotime for each lineage

So far we have visualized the lineages together. We will want to break this down and compare to clusters. First we will look at clusters.

plotUMAP(sce.sling2, colour_by = "label")

Trajectory and pseudotime for each lineage

We can reveal which cluster belong to each cluster and lineage 1’s trajectory.

embedded@metadata$lineages$Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_1") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2))

Trajectory and pseudotime for each lineage

We can reveal which cluster belong to each cluster and lineage 2’s trajectory.

embedded@metadata$lineages$Lineage2
## [1] "8" "5" "9" "1" "2" "6"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_2") + geom_path(data = curve$curve2,
    aes(x = UMAP1, y = UMAP2))

Identify driver genes for each lineage

Driver genes were the genes whose expression level changes reflect the pseudotime changes. We can find these using testPseudotime() from TSCAN. Using this the gene expression and pseudotime are fitted with a non-linear model, followed by testing with ANOVA.

We will start by just looking at lineage 1.

library(TSCAN)
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_1)

Identify driver genes for each lineage

We can see the results. Often we will want to reorder it too prioritize the biggest changers.

res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
head(res)
## DataFrame with 6 rows and 4 columns
##                        logFC      p.value          FDR             SYMBOL
##                    <numeric>    <numeric>    <numeric>        <character>
## ENSMUSG00000021728 0.0678996  0.00000e+00  0.00000e+00 ENSMUSG00000021728
## ENSMUSG00000016494 0.0654681  0.00000e+00  0.00000e+00 ENSMUSG00000016494
## ENSMUSG00000040747 0.0639249 7.23447e-296 5.16939e-293 ENSMUSG00000040747
## ENSMUSG00000057729 0.0619007  0.00000e+00  0.00000e+00 ENSMUSG00000057729
## ENSMUSG00000021998 0.0610830  0.00000e+00  0.00000e+00 ENSMUSG00000021998
## ENSMUSG00000090164 0.0608169 4.81695e-237 1.76510e-234 ENSMUSG00000090164

Identify driver genes for each lineage

We can then categorize our data based on significance and FC.

  • Recommend FDR threshold: 0.05
  • The logFC reflects the trend of gene expression toward pseudotime.
    • logFC > 0, gene expression increased through increasing pseudotime [related to late stage]
    • logFC < 0, gene expression decreased through increasing pseudotime [related to early stage]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
res[1:2, ]
## DataFrame with 2 rows and 5 columns
##                        logFC   p.value       FDR             SYMBOL        stat
##                    <numeric> <numeric> <numeric>        <character> <character>
## ENSMUSG00000021728 0.0678996         0         0 ENSMUSG00000021728        late
## ENSMUSG00000016494 0.0654681         0         0 ENSMUSG00000016494        late

Identify driver genes for each lineage

We can quickly and easily see how many are upregulated/downregulated across the trajectory.

table(res$stat)
## 
## early  late    nc 
## 11261  8848 25969

Top driver genes

Using the order we can easily isolate the driver genes. Here we can see the top 5 early genes.

sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
top
## [1] "ENSMUSG00000054191" "ENSMUSG00000004655" "ENSMUSG00000027556"
## [4] "ENSMUSG00000031877" "ENSMUSG00000031162"

Plotting top driver genes

With plotExpression() we can look at the expression of specific genes i.e. our top genes. We can see how each gene is expressed across all cells and psuedotime.

plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

We can now repeat this for the top 5 late genes.

sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

We can now repeat this for the top 5 late genes.

Identify driver genes

Now lets repeat this whole process for lineage 2.

res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_2)
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"

table(res$stat)
## 
## early  late    nc 
## 11566  8450 26062

Plotting top driver genes

sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

CITE-Seq


CITE-Seq

The CITE-Seq method labels different types of cells or different samples with hashtag-labeled antibodies. Then, the cells can be pooled together into a single library and be sequencing at the same time. After sequencing, the cells can be separated by the different antibody hashtags.

  • Here, we will use a dataset from Seurat vignette link

Load data

rna.mat <- readRDS("data/pbmc_umi_mtx.rds")
dim(rna.mat)
## [1] 27117 16916
hto.mat <- readRDS("data/pbmc_hto_mtx.rds")
dim(hto.mat)
## [1]     8 16916
rownames(hto.mat)
## [1] "HTO_A" "HTO_B" "HTO_C" "HTO_D" "HTO_E" "HTO_F" "HTO_G" "HTO_H"

Prepare data

seu_obj <- CreateSeuratObject(counts = rna.mat, project = "citeSeq_demo")
seu_obj[["HTO"]] <- CreateAssayObject(counts = hto.mat)
seu_obj
## An object of class Seurat 
## 27125 features across 16916 samples within 2 assays 
## Active assay: RNA (27117 features, 0 variable features)
##  1 other assay present: HTO

Cluster with regular workflow

DefaultAssay(seu_obj) <- "RNA"
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)

Cluster with regular workflow

seu_obj <- quick_clust(seu_obj)
DimPlot(seu_obj, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()

Differentiate Hashtags

  • Normalize data with centered log ratio (CLR)
  • Demultiplex HTOs with HTODemux()
  • Threshold for positive call: 0.99 quantile
seu_obj <- NormalizeData(seu_obj, assay = "HTO", normalization.method = "CLR")
seu_obj <- HTODemux(seu_obj, assay = "HTO", positive.quantile = 0.99)

head(seu_obj, 2)
##                    orig.ident nCount_RNA nFeature_RNA nCount_HTO nFeature_HTO
## AGGCCACAGCGTCTAT citeSeq_demo        273          210       4200            7
## ATTGGTGAGTTCGCAT citeSeq_demo        305          174       3475            8
##                  RNA_snn_res.0.5 seurat_clusters HTO_maxID HTO_secondID
## AGGCCACAGCGTCTAT               1               1     HTO-H        HTO-E
## ATTGGTGAGTTCGCAT               0               0     HTO-H        HTO-G
##                  HTO_margin HTO_classification HTO_classification.global
## AGGCCACAGCGTCTAT 4.71133972              HTO-H                   Singlet
## ATTGGTGAGTTCGCAT 0.03995001        HTO-G_HTO-H                   Doublet
##                  hash.ID
## AGGCCACAGCGTCTAT   HTO-H
## ATTGGTGAGTTCGCAT Doublet

Differentiate Hashtags

table(seu_obj$HTO_classification.global)
## 
##  Doublet Negative  Singlet 
##     2598      346    13972
table(seu_obj$hash.ID)
## 
##  Doublet    HTO-H    HTO-D    HTO-E    HTO-G    HTO-F    HTO-B    HTO-C 
##     2598     1808     1716     1487     1660     1520     1993     1873 
##    HTO-A Negative 
##     1915      346

Evaluate HTOs

RidgePlot(seu_obj, group.by = "hash.ID", assay = "HTO", features = rownames(seu_obj[["HTO"]])[1:2],
    ncol = 2)

Evaluate HTOs

RidgePlot(seu_obj, group.by = "hash.ID", assay = "HTO", features = rownames(seu_obj[["HTO"]])[3:4],
    ncol = 2)